Instructions:
(15 points) In class, we studied the logit model with 2 classes. Now consider the multilogit model with $K$ classes. Let $\beta$ be the $(p+1)(K-1)$-vector consisting of all the coefficients. Define a suitably enlarged version of the input vector x to accomodate this vectorized coefficient matrix. Derive the Newton-Raphson algorithm for maximizing the multinomial log-likelihood, and describe how you implement the algorithm (e.g., you can write a sudo code).
From Discussion: $$ x, y \in {0,1} \ P(Y = y | X) = P(Y = 1 | X )^y P(Y= 0|x )^{1 - y}\ log(P(Y=y|x)) = y log(P(Y=1|x)) + (1-y)log(P(Y=0|x)) \ = y log \frac{P(Y=1|x)} {P(Y = 0|X )} + log(P(Y=0|X)) \ \ log \frac{P(Y=1|X)}{P(Y = 0|X)} = \beta_0 + \beta_1X_1 + \ldots \ log \frac{P(Y=1|X)}{P(Y = 0|X)} = X^T\beta \ P(Y = 1|X) = \frac{e^{x^T\beta}}{1 + e^{x^T\beta}} \ P(Y = 0|X) = \frac{1}{1 + e^{x^T\beta}} \ \ \beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{p} \end{bmatrix}
X = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_{p} \end{bmatrix} \ log P(Y= y |x ) = yx^T\beta + log P(Y = 0|X) \ = yx^T\beta - log (1 + e^{x^T\beta}) \
X_i, yi (i = ,1 \ldots ,n): \ \sum{i =1}^n log P(y_i|xi) =\sum{i =1}^n y_i*x_i^T\beta - log (1 + e^{x_i^T\beta}) $$
We can describe the likelihood as the following: $$ \begin{aligned} L&= \prod_{i=1}^{n}p(y_i | x_i) \\ p(y|x) &= P(Y=1 |X)^{y_1}P(Y=2 |X)^{y_2} \ldots P(Y=K-1 |X)^{y_{K-1}} *[1 - P(Y=1 |X) - P(Y=2 |X) - \ldots - P(Y=K-1 |X)]^{1 - \sum_i^{K-1}y_i} \\ \ell &= \sum_i^n log(p({y_i}|x_i) \\ log(p({y_i}|x_i)&= y_1 log(P(Y=1 |X)) + \dots + y_{K-1}log(P(Y=K-1 |X)) + (1 - y_1 - \ldots - y_{K-1})log(P(Y = K|X )) \\ &= y_1 log(P(Y=1 |X)) + \dots + y_{K-1}log(P(Y=K-1 |X)) + \big[log(P(Y = K|X )) - y_1log(P(Y = K|X )) - \ldots - y_{K-1}log(P(Y = K|X )\big]\\ &= y_1 \frac{log(P(Y=1 |X))}{log(P(Y = K|X ))} + \dots + y_{K-1}\frac{log(P(Y=K-1 |X))}{log(P(Y = K|X ))} + log(P(Y = K|X ))\\ & = \log(Pr(Y = K|X))+y_1(\beta_{01}+\beta_1^Tx)+\ldots+y_{K-1}(\beta_{0(k-1)}+\beta_{K-1}^Tx)\\ \\ \ell &= \sum_i^n \big[\log(Pr(Y = K|X))+y_1(\beta_{01}+\beta_1^Tx)+\ldots+y_{K-1}(\beta_{0(k-1)}+\beta_{K-1}^Tx) \big]\\ &= \sum_i^n \big[\log(Pr(Y = K|X))+\sum_j^{K-1}y_j(\beta_{0j}+\beta_j^Tx)) \big]\\ &= \sum_i^n \big[- log(1 + e^{x_i^T\beta}))+\sum_j^{K-1}y_ix_i^T\beta_j) \big]\\ &= \sum_i^n \big[- log(1 + \sum_j^{K-1}e^{x_i^T\beta_j}))+\sum_j^{K-1}y_ix_i^T\beta_j) \big]\\ \end{aligned} $$
We must now maximize this multinomial log-likelihood with the Newtown-Raphson algorithm.
$$ \begin{aligned} \frac{\partial \ell}{\partial \beta_j}&=\sum_{i=1}^{n} - \frac{x_ie^{x_i^T\beta_j}}{1+\sum_{k=1}^{K-1}e^{x_i^T\beta_{k}}} + y_{i}x_i\\ &=\sum_{i=1}^{N}\left(y_{il}-Pr(G=l|X=x_i)\right)x_i\\ \frac{\partial^2 \ell}{\partial \beta_j\partial \beta_{k}} &= \sum_{i=1}^{n} - \frac{x_ie^{x_i^T\beta_j}}{1+\sum_{k=1}^{K-1}x_i^Te^{x_i^T\beta_{k}}}\\ \end{aligned} $$Which we can be coded into the updating step for Newtown-Raphson:
$$ \pi_{new} = \pi_{old} - \big(\frac{\partial^2 \ell}{\partial \beta_j\partial \beta_{k}}\big)^{-1}\frac{\partial \ell}{\partial \beta_j} $$Natural language processing (NLP) is a branch of artificial intelligence which gives computers the ability to learn text and spoken words in much the same way human beings can.
In python, text data can be converted into vector data through a vectorization operation.
Two vectorizer packages in Python are sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfVectorizer. A corpus is a collection of documents and the dictionary is all of the words in the corpus. A simple vectorizer will let $X_{i,j}$ be the number of times the $j$th word is in the $i$th document.
Bag-of-words models is one of the most popular model in NLP. The model treats each document as a set of words but ignoring the order of those words.
In this exercise, you will learn how to classify a text using SVM. The dataset includes two CSV files (Corona_NLP_train.csv and Corona_NLP_test.csv) that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The real-time Twitter feed is monitored for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. The oldest tweets in this dataset date back to October 01, 2019.
The training dataset contains five columns:
extremely positive, positive, negative, extremely negative, neutral)The task is to predict sentiment basedon the original tweet. Here, we combine extremely positive and positive to positive and combine extremely negative and negative to negative. So, the sentiment contains three labels.
Your goal is to apply svm to predict the three labels based on OriginalTweet. Indeed, one can view this as a classification problem with three labels.
I already attached the file dataprocessing.ipynb for processing the data. The code is directly copied from this website.
Please answer the following questions:
Corona_NLP_train.csv) and predict on the TEST split (Corona_NLP_test.csv). Plot your ROC and PR (Precision-Recall) curves for predicting positive (versus everything else); use the linear kernel and set the C parameter to be 1. Do the same for predicting the negative label versus everything else. Please write the code for generating the ROC curve by yourself.svm.SVC fits SVM for multi-class classification.Note: the PR curve is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class.
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Sklearn optimization
# !pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle
import time
import re
from IPython.display import display
from sklearn.metrics import *
# Plotting
import plotly.express as px
import plotly.io as pio
# This ensures Plotly output works in multiple places:
# plotly_mimetype: VS Code notebook UI
# notebook: "Jupyter: Export to HTML" command in VS Code
# See https://plotly.com/python/renderers/#multiple-renderers
pio.renderers.default = "plotly_mimetype+notebook"
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
# Preprocessing
"""
# Loading in the data
trainSet =pd.read_csv("Corona_NLP_train.csv", encoding="latin1")
testSet = pd.read_csv("Corona_NLP_test.csv", encoding = "latin1")
unrelevant_features = ["UserName", "ScreenName", "Location", "TweetAt"]
trainSet.drop(unrelevant_features, inplace = True, axis = 1)
testSet.drop(unrelevant_features, inplace = True, axis = 1)
#display(trainSet.head())
trainSet.Sentiment = trainSet.Sentiment.replace("Extremely Postive", "Positive")
trainSet.Sentiment = trainSet.Sentiment.replace("Extremely Negative", "Negative")
#display(trainSet.head())
testSet.Sentiment = testSet.Sentiment.replace("Extremely Positive", "Positive")
testSet.Sentiment = testSet.Sentiment.replace("Extremely Negative", "Negative")
# Convert negatives as 0, neutrals as 1, positives as 2,
mapping = {"Negative": 0, "Neutral": 1, "Positive":2}
trainSet.Sentiment = trainSet.Sentiment.replace(mapping)
testSet.Sentiment = testSet.Sentiment.replace(mapping)
data = pd.concat([trainSet, testSet])
display(data.info())
display(data.head()) """
train_set = pd.read_csv('Corona_NLP_train.csv',encoding="latin1") # do not forget to change the path
test_set = pd.read_csv('Corona_NLP_test.csv',encoding="latin1")
# remove unrelevant_features
unrelevant_features = ["UserName","ScreenName","Location","TweetAt"]
train_set.drop(unrelevant_features,inplace=True,axis=1)
test_set.drop(unrelevant_features,inplace=True,axis=1)
display(train_set.head())
# split data based on sentiment values: positive, neutral or negative.
# Extremely positive is combined with positive. Similar to extremely negative
display(train_set["Sentiment"].value_counts())
positives = train_set[(train_set["Sentiment"] == "Positive") | (train_set["Sentiment"] == "Extremely Positive")]
positives_test = test_set[(test_set["Sentiment"] == "Positive") | (test_set["Sentiment"] == "Extremely Positive")]
print(positives["Sentiment"].value_counts())
display(positives.head())
negatives = train_set[(train_set["Sentiment"] == "Negative") | (train_set["Sentiment"] == "Extremely Negative")]
negatives_test = test_set[(test_set["Sentiment"] == "Negative") | (test_set["Sentiment"] == "Extremely Negative")]
#print(negatives["Sentiment"].value_counts())
#display(negatives.head())
neutrals = train_set[train_set["Sentiment"] == "Neutral"]
neutrals_test = test_set[test_set["Sentiment"] == "Neutral"]
#print(neutrals["Sentiment"].value_counts())
#display(neutrals.head())
# Convert labels into integers
# convert negatives as 0
# neutrals as 1
# and positives as 2.
import warnings as wrn
wrn.filterwarnings('ignore')
negatives["Sentiment"] = 0
negatives_test["Sentiment"] = 0
positives["Sentiment"] = 2
positives_test["Sentiment"] = 2
neutrals["Sentiment"] = 1
neutrals_test["Sentiment"] = 1
# concatenate train and test first, will split them after processing.
data = pd.concat([positives,
positives_test,
neutrals,
neutrals_test,
negatives,
negatives_test
],axis=0)
data.reset_index(inplace=True)
#print(data.info())
#print(data.head())
| OriginalTweet | Sentiment | |
|---|---|---|
| 0 | @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... | Neutral |
| 1 | advice Talk to your neighbours family to excha... | Positive |
| 2 | Coronavirus Australia: Woolworths to give elde... | Positive |
| 3 | My food stock is not the only one which is emp... | Positive |
| 4 | Me, ready to go at supermarket during the #COV... | Extremely Negative |
Sentiment Positive 11422 Negative 9917 Neutral 7713 Extremely Positive 6624 Extremely Negative 5481 Name: count, dtype: int64
Sentiment Positive 11422 Extremely Positive 6624 Name: count, dtype: int64
| OriginalTweet | Sentiment | |
|---|---|---|
| 1 | advice Talk to your neighbours family to excha... | Positive |
| 2 | Coronavirus Australia: Woolworths to give elde... | Positive |
| 3 | My food stock is not the only one which is emp... | Positive |
| 5 | As news of the regionÂs first confirmed COVID... | Positive |
| 6 | Cashier at grocery store was sharing his insig... | Positive |
#nltk.download('omw-1.4')
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
cleanedData = []
lemma = WordNetLemmatizer()
swords = stopwords.words("english")
for text in data["OriginalTweet"]:
# Cleaning links
text = re.sub(r'http\S+', '', text)
# Cleaning everything except alphabetical and numerical characters
text = re.sub("[^a-zA-Z0-9]"," ",text)
# Tokenizing and lemmatizing
text = nltk.word_tokenize(text.lower())
text = [lemma.lemmatize(word) for word in text]
# Removing stopwords
text = [word for word in text if word not in swords]
# Joining
text = " ".join(text)
cleanedData.append(text)
# check the output text
for i in range(0,5):
print(cleanedData[i],end="\n\n")
advice talk neighbour family exchange phone number create contact list phone number neighbour school employer chemist gp set online shopping account po adequate supply regular med order coronavirus australia woolworth give elderly disabled dedicated shopping hour amid covid 19 outbreak food stock one empty please panic enough food everyone take need stay calm stay safe covid19france covid 19 covid19 coronavirus confinement confinementotal confinementgeneral news region first confirmed covid 19 case came sullivan county last week people flocked area store purchase cleaning supply hand sanitizer food toilet paper good tim dodson report cashier grocery store wa sharing insight covid 19 prove credibility commented civics class know talking
# create the bag of words
vectorizer = CountVectorizer(max_features=10000)
BOW = vectorizer.fit_transform(cleanedData)
# split the dataset into training and test
""" x_train,x_test,y_train,y_test = train_test_split(BOW,np.asarray(data["Sentiment"]))
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape) """
' x_train,x_test,y_train,y_test = train_test_split(BOW,np.asarray(data["Sentiment"]))\n\nprint(x_train.shape)\nprint(x_test.shape)\nprint(y_train.shape)\nprint(y_test.shape) '
from sklearn.svm import SVC
start_time = time.time()
model = SVC(C = 1, kernel="linear", probability=True)
#model.fit(x_train,y_train)
end_time = time.time()
process_time = round(end_time-start_time,2)
#print("Fitting SVC took {} seconds".format(process_time))
#model.predict(x_test)
import copy
# Generate data that is all positive
data_pos = copy.deepcopy(data) # .Sentiment.replace(1, 0)
data_pos.Sentiment = data_pos.Sentiment.replace(1, 0) # 0 is negative or neutral
data_pos.Sentiment = data_pos.Sentiment.replace(2, 1) # 1 is positive
# Split for positive data
x_train_pos,x_test_pos,y_train_pos,y_test_pos = train_test_split(BOW,np.asarray(data_pos["Sentiment"]))
model_pos = SVC(C = 1, kernel="linear", probability=True)
model_pos.fit(x_train_pos, y_train_pos)
SVC(C=1, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, kernel='linear', probability=True)
model_pos_pred = model_pos.predict(x_test_pos)
unique, counts = np.unique(model_pos_pred, return_counts=True)
dict(zip(unique, counts))
# Find probabilities
model_pos_probs = model_pos.predict_proba(x_test_pos)
# https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
fpr_pos, tpr_pos, thresh_pos = roc_curve(y_test_pos, model_pos_probs[:, 1])
# Since we are interested in negative class (y = 0), we have to flip axis for consistent ROC curves
fig_pos = px.area(x = fpr_pos, y = tpr_pos, title = f'ROC Curve for Positive Class vs Everything Else')
fig_pos.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_pos
# Create data negative vs everything else
data_neg = copy.deepcopy(data)
data_neg.Sentiment = data_neg.Sentiment.replace(2, 1) # 1 is positive or neutral, 0 is negative
model_neg = SVC(C = 1, kernel="linear", probability=True)
# Split for negative data
x_train_neg,x_test_neg,y_train_neg,y_test_neg = train_test_split(BOW,np.asarray(data_neg["Sentiment"]))
model_neg.fit(x_train_neg, y_train_neg)
SVC(C=1, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, kernel='linear', probability=True)
model_neg_pred = model_neg.predict(x_test_neg)
unique, counts = np.unique(model_neg_pred, return_counts=True)
dict(zip(unique, counts))
{0: 4151, 1: 7088}
model_neg_probs = model_neg.predict_proba(x_test_neg)
# https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
# Since we are prediciting the negative class, we will look at probabilities for 0, that is:
fpr_neg, tpr_neg, thresh_neg = roc_curve(y_test_neg, model_neg_probs[:, 0])
# Since we are interested in negative class (y = 0), we have to flip axis for consistent ROC curves
fig_neg = px.area(x = tpr_neg, y = fpr_neg, title = f'ROC Curve for Negative Class vs Everything Else')
fig_neg.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_neg.show()
From the package sklearn, svm.SVC can be used for both binary and multiclass classification. The multiclass classification (in our case 3 classes instead of two) is done with a one versus one approach. However, this can also be done with a one vs rest approach by using mutli_class = "crammer_singer', which provides similar results for a significantly reduced run time.
# Testing positive models with C = 0.25, 0.75, 1.25, 1.75
model_posp25 = SVC(C = 0.01, kernel="linear", probability=True)
model_posp25.fit(x_train_pos, y_train_pos)
SVC(C=0.01, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=0.01, kernel='linear', probability=True)
model_pos_probs25 = model_posp25.predict_proba(x_test_pos)
fpr_pos25, tpr_pos25, thresh_pos = roc_curve(y_test_pos, model_pos_probs25[:, 1])
# Since we are interested in negative class (y = 0), we have to flip axis for consistent ROC curves
fig_pos25 = px.area(x = fpr_pos25, y = tpr_pos25, title = f'ROC Curve for Positive Class vs Everything Else, C = 0.25')
fig_pos25.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_pos25.show()
# Testing positive models with C = 0.25, 0.75, 1.25, 1.75
model_posp75 = SVC(C = 0.5, kernel="linear", probability=True)
model_posp75.fit(x_train_pos, y_train_pos)
SVC(C=0.5, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=0.5, kernel='linear', probability=True)
model_pos_probs75 = model_posp75.predict_proba(x_test_pos)
fpr_pos75, tpr_pos75, thresh_pos = roc_curve(y_test_pos, model_pos_probs75[:, 1])
# Since we are interested in negative class (y = 0), we have to flip axis for consistent ROC curves
fig_pos75 = px.area(x = fpr_pos75, y = tpr_pos75, title = f'ROC Curve for Positive Class vs Everything Else, C = 0.75')
fig_pos75.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_pos75.add_scatter(x=fpr_pos25, y=tpr_pos25)
fig_pos75.show()
# This works, but better aproach would make wide pandas data frame, convert it to long, and the make one area plot where color is each value of C
# Testing positive models with C = 0.25, 0.75, 1.25, 1.75
model_pos125 = SVC(C = 2, kernel="linear", probability=True)
model_pos125.fit(x_train_pos, y_train_pos)
SVC(C=2, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=2, kernel='linear', probability=True)
model_pos_probs125 = model_pos125.predict_proba(x_test_pos)
fpr_pos125, tpr_pos125, thresh_pos = roc_curve(y_test_pos, model_pos_probs125[:, 1])
# Testing positive models with C = 0.25, 0.75, 1.25, 1.75
model_pos175 = SVC(C = 1.25, kernel="linear", probability=True)
model_pos175.fit(x_train_pos, y_train_pos)
SVC(C=1.25, kernel='linear', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1.25, kernel='linear', probability=True)
model_pos_probs175 = model_pos175.predict_proba(x_test_pos)
fpr_pos175, tpr_pos175, thresh_pos = roc_curve(y_test_pos, model_pos_probs175[:, 1])
# Since we are interested in negative class (y = 0), we have to flip axis for consistent ROC curves
fig_posAll = px.area(x = fpr_pos25, y = tpr_pos25, title = f'ROC Curve for Positive Class vs Everything Else, C = 0.01, 0.75, 1.25, 2')
fig_posAll.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_posAll.add_scatter(x=fpr_pos75, y=tpr_pos75)
fig_posAll.add_scatter(x=fpr_pos125, y=tpr_pos125)
fig_posAll.add_scatter(x=fpr_pos175, y=tpr_pos175)
#fig_posAll.update_traces(name = ["C = 0.75", "C = 2", "C = 3"], showlegend = True)
fig_posAll.show()
# NOW BUILDING THE NEGATIVE MODELS
# with C = 0.01, 0.75, 1.5, 2
# 0.01
model_neg01 = SVC(C = 0.01, kernel="linear", probability=True)
model_neg01.fit(x_train_neg, y_train_neg)
model_neg_probs01 = model_neg01.predict_proba(x_test_neg)
fpr_neg01, tpr_neg01, thresh_pos = roc_curve(y_test_neg, model_neg_probs01[:, 1])
# 0.75
model_neg75 = SVC(C = 0.75, kernel="linear", probability=True)
model_neg75.fit(x_train_neg, y_train_neg)
model_neg_probs75 = model_neg75.predict_proba(x_test_neg)
fpr_neg75, tpr_neg75, thresh_pos = roc_curve(y_test_neg, model_neg_probs75[:, 1])
# 1.5
model_neg15 = SVC(C = 1.5, kernel="linear", probability=True)
model_neg15.fit(x_train_neg, y_train_neg)
model_neg_probs15 = model_neg15.predict_proba(x_test_neg)
fpr_neg15, tpr_neg15, thresh_pos = roc_curve(y_test_neg, model_neg_probs15[:, 1])
# 2
model_neg2 = SVC(C = 2, kernel="linear", probability=True)
model_neg2.fit(x_train_neg, y_train_neg)
model_neg_probs2 = model_neg2.predict_proba(x_test_neg)
fpr_neg2, tpr_neg2, thresh_pos = roc_curve(y_test_neg, model_neg_probs2[:, 1])
fig_posAll = px.area(x = fpr_neg01, y = tpr_neg01, title = f'ROC Curve for Negative Class vs Everything Else, C = 0.01, 0.75, 1.5, 2')
fig_posAll.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_posAll.add_scatter(x=fpr_neg75, y=tpr_neg75)
fig_posAll.add_scatter(x=fpr_neg15, y=tpr_neg15)
fig_posAll.add_scatter(x=fpr_neg2, y=tpr_neg2)
#fig_posAll.update_traces(name = ["C = 0.75", "C = 2", "C = 3"], showlegend = True)
fig_posAll.show()
We can see that as the value of C increases, we have a faster rate of convergence for the ROC curve. However, in increasing C, we also suffer a longer runtime for the classification.
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression().fit(x_train_pos, y_train_pos)
log_predict = log_model.predict_proba(x_test_pos)
fpr_log, tpr_log, thresh_log = roc_curve(y_test_pos, log_predict[:, 1])
fig_log = px.area(x = fpr_log, y = tpr_log, title = f'ROC Curve for Positive Class vs Everything Else For Logistic Model')
fig_log.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_log.add_scatter(x=fpr_pos75, y=tpr_pos75)
fig_log.show()
# Negative Class
log_model = LogisticRegression().fit(x_train_neg, y_train_neg)
log_predict = log_model.predict_proba(x_test_neg)
fpr_log_neg, tpr_log_neg, thresh_log = roc_curve(y_test_neg, log_predict[:, 1])
fig_log_neg = px.area(x = fpr_log_neg, y = tpr_log_neg, title = f'ROC Curve for Negative Class vs Everything Else For Logistic Model')
fig_log_neg.add_shape(type = 'line', line = dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig_log_neg.add_scatter(x=fpr_neg75, y=tpr_neg75)
fig_log_neg.show()
From the plot of the ROC curve, we can see that the logistic regression converges to a higher value in the ROC curve than the SVM at C = 0.75 (shown in red).
Load the poses.csv dataset, which is a concatenation of other datasets to form a larger dataset. The task column in the dataset contains six poses: sitting, lying, walking, standing, cycling, bending. I want you to act like the dataset is from the same experiment. You need to open the file and take a look the dataset first. Combining bending1 and bending2 together.
.components_)?from sklearn.decomposition import PCA
poses = pd.read_csv("poses.csv")
# Combine bending1 and bending2
poses.task = poses.task.replace(["bending1","bending2"], "bending")
poses.dropna(inplace=True)
task = poses.task.tail(-1)
# Drop unnamed and time
# inplace =True saves us having to redeclare the data frame
poses.drop(columns = ["Unnamed: 0", "# Columns: time", "filename", "task"], inplace=True)
variables = list(poses)
# Drop all rows with NA
poses.dropna(inplace=True)
poses.iloc[:,[0,1,2,3,4,5]] = poses.iloc[:,[0,1,2,3,4,5]].diff()
#poses = pd.get_dummies(poses)
# drop first row and all nan values
poses = poses.tail(-1)
scaler = StandardScaler()
#scaler.fit_transform(poses)
poses =scaler.fit_transform(poses)
display(poses)
pca_poses = PCA(n_components=2)
poses_fit = pca_poses.fit_transform(poses)
display(px.scatter(x = poses_fit[:,0], y = poses_fit[:,1], title = "Two Principal Components for Poses Data"))
pd.DataFrame(data={'PC1':pca_poses.components_[0,:], 'PC2':pca_poses.components_[1,:]}, index = variables)
array([[-7.19160928e-02, 2.27858399e-01, 3.44301243e+00,
-1.77233671e+00, -2.38669530e+00, 3.92144609e-01],
[-2.30190045e-02, 2.11865769e-02, -4.19031522e+00,
2.33757686e+00, 8.82863637e-02, 8.67858031e-01],
[-4.80350595e-01, 1.85464179e-01, 1.14991536e+00,
-2.88971947e+00, 2.03291482e+00, -1.11856694e+00],
...,
[ 7.18988728e-02, -1.74886690e-01, 2.58155383e-01,
1.23303758e+00, -1.05838645e-04, 2.37859174e-01],
[ 7.18988728e-02, -3.71054755e-02, 9.46773124e-01,
-2.76071305e-01, -3.53674648e-01, -2.37854248e-01],
[-1.43823576e-01, 2.11960567e-01, 1.72146808e+00,
-9.56775740e-01, -1.76890243e-01, 1.35002218e-01]])
| PC1 | PC2 | |
|---|---|---|
| avg_rss12 | -0.134953 | 0.095090 |
| var_rss12 | -0.236896 | -0.107579 |
| avg_rss13 | 0.591286 | 0.346090 |
| var_rss13 | -0.585028 | -0.340849 |
| avg_rss23 | 0.353578 | -0.611760 |
| var_rss23 | -0.329809 | 0.607600 |
From the table of principal components, We can see that the first principal components has the most loading from the variables avg_rss12, avg_rss13 and abg_rss23. The second component has the most loading from var_rss12, var_rss13, and avg_rss13.
sklearn.metrics.confusion_matrix) of the cluster against the 'task'. Is there a clear mapping from clusters to task?from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
poses_train, poses_test, task_train, task_test = train_test_split(poses, task, test_size = 0.3, random_state = 0)
# display(poses)
kmeans = KMeans(n_clusters = 6, random_state = 0, n_init='auto').fit(poses_train) #, task_train)
#kmeans.predict(poses_test)
#task.unique()
task1 = task.replace(task.unique(), [0,1,2,3,4,5])
task_test = task_test.replace(task.unique(), [0,1,2,3,4,5])
pd.DataFrame(confusion_matrix(y_pred=kmeans.predict(poses_test), y_true = task_test), index = task.unique(), columns = task.unique())
| sitting | lying | walking | standing | cycling | bending | |
|---|---|---|---|---|---|---|
| sitting | 1764 | 104 | 81 | 120 | 5 | 4 |
| lying | 2091 | 41 | 35 | 44 | 1 | 5 |
| walking | 283 | 402 | 351 | 343 | 427 | 352 |
| standing | 1889 | 90 | 71 | 92 | 1 | 4 |
| cycling | 311 | 509 | 398 | 419 | 151 | 345 |
| bending | 1533 | 85 | 38 | 116 | 6 | 17 |
We can see from the confusion matrix, that there is not a clear mapping to the task from the data, as most of the data is classified as sitting when it should be one of the other 4 activities.